Malware data item analysis

ABSTRACT

Embodiments of the present disclosure relate to a data analysis system that may automatically analyze a suspected malware file, or group of files. Automatic analysis of the suspected malware file(s) may include one or more automatic analysis techniques. Automatic analysis of may include production and gathering of various items of information related to the suspected malware file(s) including, for example, calculated hashes, file properties, academic analysis information, file execution information, third-party analysis information, and/or the like. The analysis information may be automatically associated with the suspected malware file(s), and a user interface may be generated in which the various analysis information items are presented to a human analyst such that the analyst may quickly and efficiently evaluate the suspected malware file(s). For example, the analyst may quickly determine one or more characteristics of the suspected malware file(s), whether or not the file(s) is malware, and/or a threat level of the file(s).

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.14/668,833, filed Mar. 25, 2015, titled “MALWARE DATA ITEM ANALYSIS,”which is a continuation of U.S. patent application Ser. No. 14/473,860,filed Aug. 29, 2014, titled “MALWARE DATA ITEM ANALYSIS,” which claimsthe benefit of U.S. Provisional Patent Application No. 62/020,905, filedJul. 3, 2014, titled “MALWARE DATA ITEM ANALYSIS.” The entire disclosureof each of the above items is hereby made part of this specification asif set forth fully herein and incorporated by reference for allpurposes, for all that it contains.

Any and all applications for which a foreign or domestic priority claimis identified in the Application Data Sheet as filed with the presentapplication are hereby incorporated by reference under 37 CFR 1.57.

BACKGROUND

Embodiments of the present disclosure generally related to automaticanalysis of data items, and specifically to automatic analysis ofmalware-related data items.

Malware may include any software program (and/or group of softwareprograms) installed on a computer system and/or a network of computersystems maliciously and/or without authorization. When executed, an itemof malware may take any number of undesirable actions including, forexample, collection of private or sensitive information (for example,personal data and information, passwords and usernames, and the like),transmission of the collected information to another computing device,destruction or modification of data (for example, accessing, modifying,and/or deleting files), communication with other malware, transmissionor replication of malware onto other connected computing devices orsystems, transmission of data so as to attack another computing deviceor system (for example, a Distributed Denial of Service Attack), and/orhijacking of processing power, just to name a few.

SUMMARY

The systems, methods, and devices described herein each have severalaspects, no single one of which is solely responsible for its desirableattributes. Without limiting the scope of this disclosure, severalnon-limiting features will now be discussed briefly.

Embodiments of the present disclosure relate to a data analysis system(also referred to herein as “the system”) that may automatically analyzea suspected malware file, or group of files. Automatic analysis of thesuspected malware file(s) (also referred to herein as file data item(s))may include one or more automatic analysis techniques. Automaticanalysis of a file data item may include production and gathering ofvarious items of information (also referred to herein as “analysisinformation data items” and/or “analysis information items”) related tothe file data item including, for example, calculated hashes, fileproperties, academic analysis information, file execution information,third-party analysis information, and/or the like. The analysisinformation items may be automatically associated with the file dataitem, and a user interface may be generated in which the variousanalysis information items are presented to a human analyst such thatthe analyst may quickly and efficiently evaluate the file data item. Forexample, the analyst may quickly determine one or more characteristicsof the file data item, whether or not the file data item is malware,and/or a threat level of the file data item.

In various embodiments, the system may receive suspected malware filesfrom various users. The system may automatically analyze submitted filedata items, associate the file data items with analysis informationitems, and/or store the file data item and analysis information items inone or more data stores. The system may generate a submission data itemwith each submission of a file data item, which submission data item maybe associated with the submitted file data item. The system mayautomatically determine whether or not a particular submitted data itemwas previously submitted to the system and, if so, may associate a newsubmission data item with the previously submitted file data item.Further, in an embodiment, the system may not re-analyze a previouslysubmitted file data item. Accordingly, in various embodiments, thesystem may associate a file data item with various submission data itemssuch that information regarding, for example, a number of submissionsand/or time of submission may be presented to the analyst. Additionally,information regarding users who submitted the suspected malware filesmay be associated with the submission file data items, and may bepresented to the analyst in connection with the respective file dataitems.

In various embodiments, file data items and related information may beshared by the system with one or more third-party systems, and/orthird-party systems may share file data items and related informationwith the system.

As described, some embodiments of the present disclosure related to asystem designed to provide interactive, graphical user interfaces (alsoreferred to herein as “user interfaces”) for enabling an analyst toquickly and efficiently analyze and evaluate suspected malware datafiles. The user interfaces are interactive such that a user may makeselections, provide inputs, and/or manipulate outputs. In response tovarious user inputs, the system automatically analyzes suspected malwaredata files, associates related malware data files, and provides outputsto the user include user interfaces and various analysis informationrelated to the analyzed malware data files. The outputs, includingvarious user interfaces, may be automatically updated based onadditional inputs provided by the user.

This application is related to the following U.S. patent applications:

Docket No. Serial No. Title Filed PALAN.268A2 14/473920 EXTERNAL MALWAREAug. 19, 2014 DATA ITEM CLUSTERING AND ANALYSIS PALAN.236A 14/280490SECURITY SHARING May 16, 2014 SYSTEMThe entire disclosure of each of the above items is hereby made part ofthis specification as if set forth fully herein and incorporated byreference for all purposes, for all that it contains.

According to an embodiment, a computer system comprises: one or morecomputer readable storage devices configured to store: a plurality ofcomputer executable instructions; and a plurality of file data items andsubmission data items, each submission data item associated with atleast one file data item; and one or more hardware computer processorsin communication with the one or more computer readable storage devicesand configured to execute the plurality of computer executableinstructions in order to cause the computer system to automatically: inresponse to receiving a new file data item: determine whether thereceived new file data item was previously received by comparing thereceived new file data item to the plurality of file data items; andgenerate a new submission data item; in response to determining that thenew file data item was not previously received: initiate an analysis ofthe new file data item, wherein the analysis of the new file data itemgenerates analysis information items, wherein initiating the analysis ofthe new file data item comprises: initiating an internal analysis of thenew file data item including at least calculation of a hash of the filedata item; and initiating an external analysis of the new file data itemby one or more third party analysis systems; associate the analysisinformation items with the new file data item; associate the newsubmission data item with the new file data item; and generate a userinterface including one or more user selectable portions presentingvarious of the analysis information items.

According to another embodiment, the one or more hardware computerprocessors are further configured to execute the a plurality of computerexecutable instructions in order to cause the computer system to: inresponse to determining that the new file data item was previouslyreceived: determine a storage location of the file data item that waspreviously received; retrieve the analysis information items associatedwith the file data item that was previously received; associate the newsubmission data item with the file data item that was previouslyreceived; and generate a user interface including one or more userselectable portions presenting various of the analysis information itemsassociated with the file data item that was previously received, theuser interface usable by the analyst to determine one or morecharacteristics of the file data item that was previously received.

According to yet another embodiment, further in response to determiningthat the new file data item was previously received, the analyst isnotified via the user interface that the new file data item waspreviously received.

According to another embodiment, determining whether the received newfile data item was previously received comprises: calculating a hash ofthe received new file data item and comparing the calculated hash topreviously calculated hashes associated with the plurality of file dataitems.

According to yet another embodiment, the one or more hardware computerprocessors are further configured to execute the plurality of computerexecutable instructions in order to cause the computer system to: inresponse to an analyst input selecting to view a graph of the new filedata item, generating a graph including at least a first noderepresenting the new file data item, a second node representing the newsubmission data item, and an edge connecting the first and second nodesand representing the association between the new file data item and thenew submission data item.

According to another embodiment, the graph further includes additionalnodes representing other file data items and/or submission data itemsassociated with the new file data item, and additional edges connectingthe additional nodes and the first node and representing associationsbetween the other file data items and/or submission data items and thenew file data item.

According to yet another embodiment, the internal analysis includesanalysis performed by the one or more hardware computer processors, andwherein the internal analysis further includes at least one ofcalculation of an MD5 hash of the new file data item, calculation of aSHA-1 hash of the new file data item, calculation of a SHA-256 hash ofthe new file data item, calculation of an SSDeep hash of the new filedata item, or calculation of a size of the new file data item.

According to another embodiment, the external analysis includes analysisperformed by at least a second computer system, and wherein the externalanalysis includes execution of the new file data item in a sandboxedenvironment and analysis of the new file data item by a third-partymalware analysis service.

According to yet another embodiment, any payload provided by the newfile data item after execution of the new file data item in thesandboxed environment is associated with the new file data item.

According to another embodiment, the one or more hardware computerprocessors are further configured to execute the plurality of computerexecutable instructions in order to cause the computer system to: inresponse to an analyst input, sharing the new file data item andassociated analysis information items with a second computer systems viaa third computer system.

According to yet another embodiment, a computer-implemented methodcomprises: storing on one or more computer readable storage devices: aplurality of computer executable instructions; and a plurality of filedata items and submission data items, each submission data itemassociated with at least one file data item; in response to receiving anew file data item: determining, by one or more hardware computerdevices configured with specific computer executable instructions,whether the received new file data item was previously received bycomparing the received new file data item to the plurality of file dataitems; and generating, by the one or more hardware computer devices, anew submission data item; and in response to determining that the newfile data item was not previously received: initiating, by the one ormore hardware computer devices, an analysis of the new file data item,wherein the analysis of the new file data item generates analysisinformation items, wherein the initiating analysis of the new file dataitem comprises initiating an internal analysis of the new file data itemincluding at least calculation of a hash of the file data item;associating, by the one or more hardware computer devices, the analysisinformation items with the new file data item; associating, by the oneor more hardware computer devices, the new submission data item with thenew file data item; and generating, by the one or more hardware computerdevices, a user interface including one or more user selectable portionspresenting various of the analysis information items, the user interfaceusable by an analyst to determine one or more characteristics of the newfile data item.

According to another embodiment, the method further comprises: inresponse to determining that the new file data item was previouslyreceived: determining, by the one or more hardware computer devices, astorage location of the file data item that was previously received;retrieving, by the one or more hardware computer devices, the analysisinformation items associated with the file data item that was previouslyreceived; associating, by the one or more hardware computer devices, thenew submission data item with the file data item that was previouslyreceived; and generating, by the one or more hardware computer devices,a user interface including one or more user selectable portionspresenting various of the analysis information items associated with thefile data item that was previously received.

According to yet another embodiment, further in response to determiningthat the new file data item was previously received, the analyst isnotified via the user interface that the new file data item waspreviously received.

According to another embodiment, the internal analysis includes analysisperformed by the one or more hardware computer processors, and whereinthe internal analysis further includes at least one of calculation of anMD5 hash of the new file data item, calculation of a SHA-1 hash of thenew file data item, calculation of a SHA-256 hash of the new file dataitem, calculation of an SSDeep hash of the new file data item, orcalculation of a size of the new file data item.

According to yet another embodiment, the external analysis includesanalysis performed by at least a second computer system, and wherein theexternal analysis includes execution of the new file data item in asandboxed environment and analysis of the new file data item by athird-party malware analysis service.

According to another embodiment, a non-transitory computer-readablestorage medium is disclosed, the non-transitory computer-readablestorage medium storing software instructions that, in response toexecution by a computer system having one or more hardware processors,configure the computer system to perform operations comprising: storingon one or more computer readable storage devices: a plurality ofcomputer executable instructions; and a plurality of file data items andsubmission data items, each submission data item associated with atleast one file data item; in response to receiving a new file data item:determining whether the received new file data item was previouslyreceived by comparing the received new file data item to the pluralityof file data items; and generating a new submission data item; and inresponse to determining that the new file data item was not previouslyreceived: initiating an analysis of the new file data item, wherein theanalysis of the new file data item generates analysis information items,wherein the initiating analysis of the new file data item comprisesinitiating an internal analysis of the new file data item including atleast calculation of a hash of the file data item; associating theanalysis information items with the new file data item; associating thenew submission data item with the new file data item; and generating auser interface including one or more user selectable portions presentingvarious of the analysis information items, the user interface usable byan analyst to determine one or more characteristics of the new file dataitem.

According to yet another embodiment, the software instructions furtherconfigure the computer system to perform operations comprising: inresponse to determining that the new file data item was previouslyreceived: determining a storage location of the file data item that waspreviously received; retrieving the analysis information itemsassociated with the file data item that was previously received;associating the new submission data item with the file data item thatwas previously received; and generating a user interface including oneor more user selectable portions presenting various of the analysisinformation items associated with the file data item that was previouslyreceived.

According to another embodiment, further in response to determining thatthe new file data item was previously received, the analyst is notifiedvia the user interface that the new file data item was previouslyreceived.

According to yet another embodiment, the internal analysis includesanalysis performed by the one or more hardware computer processors, andwherein the internal analysis further includes at least one ofcalculation of an MD5 hash of the new file data item, calculation of aSHA-1 hash of the new file data item, or calculation of a size of thenew file data item.

According to another embodiment, the initiating analysis of the new filedata item further comprises: initiating an external analysis of the newfile data item, wherein the external analysis includes execution of thenew file data item in a sandboxed environment and analysis of the newfile data item by a third-party malware analysis service.

BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings and the associated descriptions are provided toillustrate embodiments of the present disclosure and do not limit thescope of the claims. Aspects and many of the attendant advantages ofthis disclosure will become more readily appreciated as the same becomebetter understood by reference to the following detailed description,when taken in conjunction with the accompanying drawings, wherein:

FIG. 1 is a flowchart of an example method of data analysis system,according to an embodiment of the present disclosure.

FIGS. 2A-2H illustrate example user interfaces of the data analysissystem, according to embodiments of the present disclosure.

FIG. 3A illustrates an embodiment of a database system using anontology.

FIG. 3B illustrates an embodiment of a system for creating data in adata store using a dynamic ontology.

FIG. 4 illustrates a sample user interface using relationships describedin a data store using a dynamic ontology.

FIG. 5 illustrates a computer system with which certain methodsdiscussed herein may be implemented.

DETAILED DESCRIPTION

Although certain preferred embodiments and examples are disclosed below,inventive subject matter extends beyond the specifically disclosedembodiments to other alternative embodiments and/or uses and tomodifications and equivalents thereof. Thus, the scope of the claimsappended hereto is not limited by any of the particular embodimentsdescribed below. For example, in any method or process disclosed herein,the acts or operations of the method or process may be performed in anysuitable sequence and are not necessarily limited to any particulardisclosed sequence. Various operations may be described as multiplediscrete operations in turn, in a manner that may be helpful inunderstanding certain embodiments; however, the order of descriptionshould not be construed to imply that these operations are orderdependent. Additionally, the structures, systems, and/or devicesdescribed herein may be embodied as integrated components or as separatecomponents. For purposes of comparing various embodiments, certainaspects and advantages of these embodiments are described. Notnecessarily all such aspects or advantages are achieved by anyparticular embodiment. Thus, for example, various embodiments may becarried out in a manner that achieves or optimizes one advantage orgroup of advantages as taught herein without necessarily achieving otheraspects or advantages as may also be taught or suggested herein.

Terms

In order to facilitate an understanding of the systems and methodsdiscussed herein, a number of terms are defined below. The terms definedbelow, as well as other terms used herein, should be construed broadlyto include, without limitation, the provided definitions, the ordinaryand customary meanings of the terms, and/or any other implied meaningsfor the respective terms. Thus, the definitions below do not limit themeaning of these terms, but only provide example definitions.

Ontology: Stored information that provides a data model for storage ofdata in one or more databases. For example, the stored data may comprisedefinitions for object types and property types for data in a database,and how objects and properties may be related.

Database: A broad term for any data structure for storing and/ororganizing data, including, but not limited to, relational databases(for example, Oracle database, mySQL database, and the like),spreadsheets, XML files, and text file, among others. The various terms“database,” “data store,” and “data source” may be used interchangeablyin the present disclosure.

Data Item (Item), Data Object (Object), or Data Entity (Entity): A datacontainer for information representing specific things in the world thathave a number of definable properties. For example, a data item mayrepresent an item such as a person, a place, an organization, anaccount, a computer, an activity, a market instrument, or other noun. Adata item may represent an event that happens at a point in time or fora duration. A data item may represent a document or other unstructureddata source such as an e-mail message, a news report, or a written paperor article. Each data item may be associated with a unique identifierthat uniquely identifies the data item. The data item's attributes (forexample, metadata about the data item) may be represented in one or moreproperties. The terms “data item,” “data object,” “data entity,” “item,”“object,” and “entity” may be used interchangeably and/or synonymouslyin the present disclosure.

Item (or Entity or Object) Type: Type of a data item (for example,Person, Event, or Document). Data item types may be defined by anontology and may be modified or updated to include additional data itemtypes. An data item definition (for example, in an ontology) may includehow the data item is related to other data items, such as being asub-data item type of another data item type (for example, an agent maybe a sub-data item of a person data item type), and the properties thedata item type may have.

Properties: Also referred to as “metadata,” includes attributes of adata item that represent individual data items. At a minimum, eachproperty of a data item has a property type and a value or values.Properties/metadata associated with data items may include anyinformation relevant to that object. For example, properties associatedwith a person data item may include a name (for example, John Doe), anaddress (for example, 123 S. Orange Street), and/or a phone number (forexample, 800-0000), among other properties. In another example, metadataassociated with a computer data item may include a list of users (forexample, user1, user 2, and the like), and/or an IP (internet protocol)address, among other properties.

Property Type: The type of data a property is, such as a string, aninteger, or a double. Property types may include complex property types,such as a series data values associated with timed ticks (for example, atime series), and the like.

Property Value: The value associated with a property, which is of thetype indicated in the property type associated with the property. Aproperty may have multiple values.

Link: A connection between two data objects, based on, for example, arelationship, an event, and/or matching properties. Links may bedirectional, such as one representing a payment from person A to B, orbidirectional.

Link Set: Set of multiple links that are shared between two or more dataobjects.

Overview

Embodiments of the present disclosure relate to a data analysis system(also referred to herein as “the system”) that may automatically analyzea suspected malware file, or group of files, and present analysisinformation to an analyst via a user interface. Malware files mayinclude any software program file (and/or group of software programfile) that may be installed on a computer system and/or a network ofcomputer systems maliciously and/or without authorization. Whenexecuted, a malware file may take any number of undesirable actionsincluding, for example, collection of private or sensitive information(for example, personal data and information, passwords and usernames,and the like), transmission of the collected information to anothercomputing device, destruction or modification of data (for example,accessing, modifying, and/or deleting files), communication with othermalware, transmission or replication of malware onto other connectedcomputing devices or systems, transmission of data so as to attackanother computing device or system (for example, a Distributed Denial ofService Attack), and/or hijacking of processing power, just to name afew. In most cases such malware infects a computing device via a networkconnection (for example, a connection to the Internet), and communicateswith another computing device or system (for example, anotherInternet-connected computing device) to accomplish its purpose.Oftentimes malware is well hidden in the infected computing device suchthat it may not be detectable to an average user of the computingdevice.

Detection and removal of malware from infected computing devices and/orsystems is a highly desirable, but oftentimes challenging task.Detection of malware is of particular importance to organizations (forexample, businesses) that maintain internal networks of computingdevices that may be connected to various external networks of computingdevices (for example, the Internet) because infection of a singlecomputing device of the internal network may quickly spread to othercomputing devices of the internal network and may result in significantdata loss and/or financial consequences.

Detection of malware may be enabled by accurate and thorough informationregarding the malware. Further, whether or not a particular file orprogram is an item of malware, and an accurate assessment of threatposed by the item of malware, may be enabled by such accurate andthorough information. Previously, determination and collection ofinformation about a suspected malware file was a labor intensive task.For example, an analyst may have had to isolate the suspected malwarefile, manually run tests and analyses on the suspected malware file, andcompile any information gleaned from such tests and analyses. Thecompiled information may be in varying formats and difficult to analyze.Further, a given suspected malware file may be analyzed multiple timesin such a manual process, unbeknownst to the analyst (because, forexample, the suspected malware file may have been found in multipledisjoint incidents).

Embodiments of the data analysis system described herein may overcomethe limitations and deficiencies of previous methods of gatheringinformation about suspected malware files. For example, an analyst maysimply provide a suspected malware file, or group of files, to the dataanalysis system for automatic analysis and generation of a userinterface by which the analyst may efficiently evaluate the analysis andthe suspected malware file. In various embodiments, the system, by wayof automatic analysis of file data items, may generate accurate andthorough information regarding the file data items. Automatic analysisof the suspected malware file(s) (also referred to herein as file dataitem(s)) may include one or more automatic analysis techniques,including, for example, determination of various properties of the filedata item, execution of the file data item in a sandbox environment todetermine payloads (e.g., files exposed and/or created by execution ofthe malware, which may be referred to as “payloads,” “drop files,”and/or “dropped file data items”) and behaviors, and/or submission ofthe file data item to one or more third-party analysis providers, justto name a few. Automatic analysis of a file data item may includeproduction and gathering of various items of information (also referredto herein as “analysis information data items” and/or “analysisinformation items”) related to the file data item including, forexample, calculated hashes, file properties, academic analysisinformation, file execution information, third-party analysisinformation, and/or the like. The analysis information items may beautomatically associated with the file data item by the system, and auser interface may be generated in which the various analysisinformation items are presented to the analyst such that a human analystmay quickly and efficiently evaluate the file data item. For example,the analyst may quickly determine one or more characteristics of thefile data item, whether or not the file data item is malware, and/or athreat level of the file data item.

In various embodiments, the system may receive suspected malware filesfrom various users. The system may automatically analyze submitted filedata items, associate the file data items with analysis informationitems, and store the file data item and analysis information items inone or more data stores. The system may generate a submission data itemwith each submission of a file data item, which submission data item maybe associated with the submitted file data item. The system mayautomatically determine whether or not a particular submitted data itemwas previously submitted to the system and, if so, may associate a newsubmission data item with the previously submitted file data item.Further, in an embodiment, the system may not re-analyze a previouslysubmitted file data item. Accordingly, in various embodiments, thesystem may associate a file data item with various submission data itemssuch that information regarding, for example, a number of submissionsand/or time of submission may be presented to the analyst. Additionally,information regarding users who submitted the suspected malware filesmay be associated with the submission file data items, and may bepresented to the analyst in connection with the respective file dataitems.

In various embodiments, file data items and related information may beshared by the system with one or more third-party systems, and/orthird-party systems may share file data items and related informationwith the system.

In various embodiments, file data items may be submitted and evaluatedby a single person (for example, the user submitting the file data item,and the analyst evaluating the results of the system's analysis, may bethe same person), or may be submitted by a first person and evaluated bya second person.

In various embodiments, the data analysis system as described herein maybe used in conjunction with additional systems and/or components toenable automatic clustering of various data items related to an analyzedfile data item. Examples of such clustering, and further analysis ofdata clusters and generation of associated user interfaces are describedin U.S. patent application Ser. No. 14/473,920, titled “EXTERNAL MALWAREDATA ITEM CLUSTERING AND ANALYSIS,” previously incorporated by referenceherein. Accordingly, in various embodiments, the data analysis systemmay enable automatic, efficient, and effective detection, analysis, andevaluation (by an analyst) of likely malware on computing devices and/ornetwork.

Description of the Figures

Embodiments of the disclosure will now be described with reference tothe accompanying Figures, wherein like numerals refer to like elementsthroughout. The terminology used in the description presented herein isnot intended to be interpreted in any limited or restrictive manner,simply because it is being utilized in conjunction with a detaileddescription of certain specific embodiments of the disclosure.Furthermore, embodiments of the disclosure described above and/or belowmay include several novel features, no single one of which is solelyresponsible for its desirable attributes or which is essential topracticing the embodiments of the disclosure herein described.

FIG. 1 is a flowchart of an example method of data analysis system,according to an embodiment of the present disclosure. In variousembodiments, fewer blocks or additional blocks may be included in theexample method of FIG. 1, or various blocks may be performed in an orderdifferent from that shown. The functionality of the data analysis systemas described in reference to FIG. 1 may be implemented in one or morecomputer modules and/or processors, as is described below in referenceto FIG. 5. For example, in various embodiments, one or more blocks inthe flowchart may be performed by one or more components of the dataanalysis system, for example, computer system 800 and/or various servers830, as described below in reference to FIG. 5.

In the embodiment of the flowchart of FIG. 1, at block 102 one or moresuspected malware files (also referred to herein as “file data items”)are received by the system. The file data items may be submitted to thesystem, for example, by a user and via a user interface of the system.FIG. 2A illustrates an example user interface 202 of the system by whicha user may submit a file data item, according to an embodiment. In theexample user interface 202, the user may provide information regardingthe submission via fields 204. For example, the user may select thesuspect malware file (or files) to submit, may indicate a case name (forexample, the user may associate the submission with a particularinvestigation or case file by providing a case name), may provide a dateassociated with the submission (that may be, for example, a date onwhich the file data items is submitted or, alternatively, a date whenthe file data item was obtained), may indicate a team or name toassociate with the file data item, and/or may provide a name of thesubmitter (for example, the user). The information provided by the usermay be associated with the submitted file data item such that the filedata item may be associated with other related file data items by thesystem (for example, other file data items of the same “case,” submittedby the same “team”, submitted within an organization, etc.). Further,the information provided by the user may be included in a further userinterface of the system as described below.

The user interface 202 further includes a submit button (or other userinterface element) 206 that the user may select to submit the file dataitem and initiate an automatic analysis of the file data item, asdescribed below. An “add to graph” button may also be provided that,when selected, may cause the system to, before or after analysis, addthe file data item to a graph and/or view the file data item and otherrelated data items in a graph or other visualization, similar to thedescription below in reference to FIGS. 2H and 4, and as described inU.S. patent application Ser. No. 14/473,920, titled “EXTERNAL MALWAREDATA ITEM CLUSTERING AND ANALYSIS.”

In other embodiments, file data items may be automatically received by,or submitted to, the system based on one or more factors, such as when afile data item is stored, accessed, and/or updated on a storage deviceof the system.

Returning to the flowchart of FIG. 1, when a file data item is receivedby the system, at block 104 the system determines whether or not thefile data item was previously received. Determination of whether a filedata item was previously received by the system may enable moreefficient operation of the system. For example, as described below, apreviously received file may not be re-analyzed, but a previous analysismay be retrieved by the system and presented to the analyst. The systemmay determine whether a file data item was previously submitted to thesystem in any of a variety of ways, and/or by any combination of thevariety of ways. For example, the system may compute and compare hashes(by, for example, any known hash function) of submitted file data items,may compare file names of submitted file data items, may compare filesizes of submitted file data items, and/or the like. As mentioned, invarious embodiments the system may base a determination of whether ornot a file data item was previously received on multiple equally orunequally weighted factors.

If the system determines that the file data item was previouslyreceived, at block 116 the system provides a previously determinedanalysis to the analyst and notifies the analyst that the file data itemwas previously received (via, for example, a popup window). For example,the system may retrieve a previous analysis of the previously submittedfile data item from a data store of the system, and as shown at block114 (and as described below), provide a user interface to the analystwith the previous analysis information. At block 118, a new submissiondata item associated with the current submission is generated by thesystem and associated with the previously submitted file data item. Thesubmission data item may include, for example, the various informationprovided in the user interface of FIG. 2A. Generation and association ofsubmission data items in connection with each submission by a userenables the system to determine all instances of particular file dataitems being submitted to the system, associate those instances with thefile data item, and present that information to the analyst, asdescribed below.

If the system determines that the file data item was not previouslyreceived, the system proceeds with an automatic analysis of the filedata item.

At block 106 the system initiates a basic analysis of the received filedata item. The basic analysis (also referred to herein as an “internalanalysis”) is generally performed by the system and may include variousanalyses of the received file data item. Examples of the variousanalyses performed on the received file data item include, for example,calculation of hash values 106 a (for example, calculation of MD5,SHA-1, SHA256, and/or the like) of the file data item, calculation offuzzy hash values 106 b (for example, calculation of SSDeep and/or thelike) of the file data item, calculation of other hashes of the filedata item, determination of a file size of the file data item (as shownat block 106 c), determination of a file type of the file data item,determination of a file name of the file data item, and/or the like. Anyinformation provided by the basic analysis processes may be referred toherein as basic analysis information items, and such basic analysisinformation items are associated with the file data item analyzed. Invarious embodiments, as described below, the basic analysis informationmay be provided to one or more external analysis services to enable moreefficient analysis of the file data item. Further, the basic analysisinformation may be used by the analyst to evaluate the file data item.For example, the analyst may determine, based on the file size of thefile data item, that the file data item is less likely to be a malwarefile.

In an embodiment, when a file data item received by the system iscompressed and/or encrypted (for example, contained in a “zip”-typefile), the system may automatically un-compress and/or un-encrypt thefile data item prior to the basic analysis (such that, for example, theactual file data item of interest may be analyzed, and not a compressedversion of the file data item). For example, when necessary the systemmay request an encryption key from the user upon determination that thefile data item is encrypted and/or when the file data item is submitted(for example, via the user interface 202). In an embodiment, the systemmay automatically determine that a submitted file data item iscompressed and/or encrypted. Similarly, the system may automaticallyun-compress and/or un-encrypt the file data item prior to the externalanalysis described below. In an embodiment, the file data item may bekept uncompressed and/or unencrypted during both basic analysis andexternal analysis, and/or during any further analysis. Alternatively,the file data item may be re-compressed and/or re-encrypted (by the sameor different algorithms as were used in the initial compression and/orencryption) between the basic analysis and the external analysis suchthat the file data item may be, for example, safely transmitted to anexternal service (as described below). In an embodiment, after analyzingthe file data item, the system may automatically re-compress and/orre-encrypt (by the same or different algorithms as were used in theinitial compression and/or encryption) the file data item prior tostoring the file data item.

FIG. 2B illustrates an example user interface of the system in which ananalysis of a file data item is presented, including basic analysisinformation, according to an embodiment. Such a user interface may beprovided, for example, at block 114 of FIG. 1. The user interface ofFIG. 2B includes various basic analysis information in boxes 212 and214. The user interface of FIG. 2B also includes a portion 216 showingvarious external analysis information, described below. Further, theuser interface of FIG. 2B includes various user interface elements (forexample, selectable buttons 218) by which an analyst may perform variousactions, described below.

Returning to the flowchart of FIG. 1, at block 108 the system initiatesan external analysis of the received file data item. The externalanalysis is generally performed by one or more computing devicesexternal to the system, however in some embodiments aspects (or allaspects) of the external analysis may be performed by the system. Theexternal analysis of the file data item may include, for example,academic analysis (as shown at block 108 a), execution of the file dataitem in a sandbox environment (as shown at block 108 b), analysis of thefile data item by one or more third-party analysis providers (forexample, FireEye, Inc. (block 108 c); VirusTotal (block 108 d), aservice provided by Google; and/or the like), aggregation of file dataitems (and/or other data items) related to the submitted file data item,and/or the like. Any information determined and/or obtained via one ormore external analysis processes may be referred to herein as externalanalysis information items, and such external analysis information itemsare associated with the file data item analyzed. In an embodiment, thesystem may automatically provide one or more basic analysis data itemsto the external analysis providers to enable a more efficient externalanalysis. For example, the system may provide a hash of the file dataitem, an encryption key of the file data item (in the example of thefile data item being encrypted and/or compressed), and/or the like suchthat the external analysis provider may quickly identify the file dataitem, un-compress and/or un-encrypt the file data item, determinewhether the file data item was previously analyzed, and/or the like.

As mentioned above, external analysis may include academic analysis 108a. Academic analysis may include, for example, transmission and/orsubmission of the file data item to an academic team and/or academicsystem for analysis, such as a graduate program at a university with afocus on improved malware detection techniques. The academic analysismay include one or more cutting edge analysis techniques, the results ofwhich may be transmitted back to the data analysis system forassociation with the file data item. The results of the academicanalysis may then be presented in a user interface of the system such asthe user interface of FIG. 2B.

As also mentioned above, external analysis may include execution of thefile data item in a sandbox environment 108 b. A sandbox environment maybe a secure computing environment specially designed for execution andanalysis of an item of malware. The sandbox is generally walled off fromany other computing system so as to prevent damage or infection of anyother computing systems by malware when the malware is executed. Forexample, a sandbox may include a virtual machine executing on acomputing system, which has no access to the operating system executingthe virtual machine, any data outside of the virtual machine, anynetworks, etc.

The data analysis system may automatically provide the file data item tosuch a sandbox environment, which may then execute the file data item.The system may then analyze and record any actions taken or initiated bythe file data item upon execution (or such information may be obtainedfrom a sandbox environment external to the system that executes the filedata item). For example, the file data item may attempt to contact oneor more URLs or domains, may make modifications to files and/or a filesystem, may make modification to an operating system registry, maydeliver one or more payloads (for example, additional files or programswritten to the computing system on which the file data item is executed,and/or executed by the file data item on the computing system on whichthe file data item is executed), and/or the like. The data analysissystem may then record such analysis information, including payloadsprovided by the file data item, and associate them with the file dataitem. The portion 216 of the user interface of FIG. 2B illustratesexample analysis information items gathered and presented in response toa sandbox analysis. In the user interface, the “indicators” selector 220is selected, such that various indictors associated with the file dataitem are shown in the portion 216. For example, the portion 216 in FIG.2B shows network connections made by the file data item, file systemchanges, and/or registry changes. Other information may also be shown inthe portion 216, and/or other sandbox analysis information may be shownin one or more other user interfaces as described below. In someembodiments, drop files (also referred to herein as “payloads,” “dropfiles,” and/or “dropped file data items”) created by executing the filedata item in a sandbox, for example, may be submitted to the system fora same or similar analysis as is discussed in FIG. 1, such as startingwith block 104. The analysis information associated with drop files,and/or file data items associated with the drop files, may then beassociated with the file data item as described below with various typesof data items.

As also mentioned above, external analysis may include transmissionand/or submission of the file data item to one or more third-partyanalysis providers for analysis. Examples of such third-party analysisproviders include FireEye (block 108 c), and VirusTotal (block 108 d).The one or more third-party analysis providers may then transmit one ormore analysis information items back to the system, where it may beassociated with the file data item and displayed to the analyst.

FIGS. 2D-2G illustrate example user interfaces of the system in which athird-party provider analyses of a file data item are presented,according to an embodiment. Each of FIGS. 2D-2G illustrate informationthat may be presented in portion 216 of FIG. 2B when, for example,external analysis selector 236 is selected by the analyst (as shown inFIG. 2D). As indicated in FIG. 2D, selection of VirusTotal button 238may cause the system to display analysis information gathered as aresult of an analysis of the file data item by VirusTotal. The analysisinformation data items returned from this example third party analysisprovider include, for example, a submission time, a vendor score (whichmay indicate, for example, a threat level of the file data item asdetermined by the third party analysis provider), a name by which thefile data item is known among one or more third-party analysis providersand/or other security vendors (for example, “Generic.qx”), and variousother information, such as the other various analysis information dataitems illustrated in display portion 240. In some embodiments, portion240 may include various selectable buttons, such as Indicators button241, the selection of which causes the system to display a particularcategory of analysis information data items, as shown. In an embodiment,in response to receiving particular vendor scores (indicating, forexample, threat levels of the file data item) the system mayautomatically alert the analyst and/or provide different visualindicators (for example, color the user interface or a portion of theuser interface with a color corresponding to the threat level). Forexample, when a vendor score indicates a sufficiently high threat level(as determined, for example, by a comparison to one or more predefinedthresholds) the system may automatically alert the analyst via a popupwindow and/or other notification (for example, an email and/or textmessage).

FIG. 2E illustrates another user interface of the data analysis systemin which the analyst has selected Antivirus Detection button 242 to viewantivirus analysis information provided by VirusTotal. FIG. 2Fillustrates another user interface of the data analysis system in whichthe analyst has selected button FireEye 244 to view analysis informationprovided by example third-party analysis provider, FireEye. As with FIG.2D, in FIG. 2F various selectable buttons (such as Alerts button 246)may be provided such that the analyst may view various analysisinformation. FIG. 2G illustrates another user interface of the dataanalysis system in which the analyst has selected Network Indicatorsbutton 248 to view network indicator analysis information provided byFireEye. Network indicators may include various analysis informationitems, such as those illustrated in the example of FIG. 2G, such asdomains, URLs, IP addresses, ports, protocols, etc. associated withexecution of the file data item selected one or more third-partyanalysis provider (FireEye in the example of FIG. 2G). In otherembodiments, other third-party analysis providers may be used and, thus,user interfaces may be updated to indicate those particular third-partyanalysis providers. In some embodiments, multiple third-party analysesmay be combined, such as by combining a threat risk score from multiplethird-party analysts into a single, easily understood risk level to beprovided to the analyst.

As also mentioned above, the external analysis block 108 may includegathering of various data items (for example, other file data items) bythe system that may be related to the file data item. Examples of suchfiles may include submission data items (for example, as generated eachtime the file data item has been submitted to the system, as describedabove and below), other files submitted to the system by users anddesignated as related to the file data item, payloads gathered fromexecution of the file data item in a sandbox environment, and/or thelike.

FIG. 2C illustrates an example user interface of the system in whichrelated files are displayed, according to an embodiment. As shown, inresponse to the analyst's selection of Related Items button 230, variousrelated file information is displayed in user interface portions 232 and234. Portion 232 may display, for example, a list of submission dataitems associated with the file data item. Each time the file data itemis submitted to the system, as described above and below, a newsubmission data item is created and associated with the file data item.Information regarding those submission data items may be viewed andaccessed in portion 232 of the Related Items tab of the example userinterface. Portion 234 may display, for example, a list of other filedata items (and/or other data items) associated with the submitted filedata item. For example, the portion 234 may list data items gathered bythe system when the file data item is executed in a sandbox environment.Additionally, the analyst (and/or other user of the system) may manuallysubmit data items to the system via, for example, the “Upload RelatedFiles” button shown in FIG. 2C. When a file data item is submitted tothe system in this way, the system again automatically checks whetherthe file data item was previously submitted, and if so, it may notifythe analyst via, for example, a popup window. Further, the submittedfile data item is then listed in the portion 234 as the file data itemis associated with the originally submitted file data item. In anembodiment, selection of a data item listed in the portions 232 and/or234 causes the data analysis system to display a user interface (forexample, similar to the user interfaces of FIGS. 2B-2G) with detailsrelated to the selected data item. In this embodiment, the portion 234with respect to each of two related file data items would show, in thelist, a link to the other file data item.

Returning to the flowchart of FIG. 1, at block 110 the system associatesthe various analysis information items, such as from one or moreinternal analyses (e.g., block 106) and/or one or more external analyses(e.g., block 108) with the file data item. Further, at block 112 thesystem generates a submission data item (for example, related to thesubmission of the file data item at block 102) and associates thesubmission data item with the submitted file data item (similar to thedescription of block 118 provided above).

At block 114 the user interface (for example, the user interface of FIG.2C) is provided to the analyst such that the analyst may view thevarious analysis information items and quickly determine one or morecharacteristics of the file data item, whether or not the file data itemis malware, and/or a threat level of the file data item. As mentionedabove, the user interface of FIG. 2B includes various selectable buttons218 by which an analyst may perform various actions to view andinvestigate information related to an analyzed file data item. Forexample, an “export” button may be used to export the gathered analysisinformation items to another file format and/or to another application;an “edit” button may be used to edit information associated with thefile data item; a “save” button may be used to commit any changes to theinformation to a data store of the system; an “add to graph” button maybe used, as described above, to add the file data item and/or anyrelated data items to a graph and/or view the file data item and otherrelated data items in a graph or other visualization, as described inreference to FIG. 2H below; an “export malware” button may be used toretrieve the file data item and/or related analysis information from thesystem (for example, to transfer the file data item another computingsystem for further analysis); and/or a “refresh external analysis”button may be used to cause the system to re-run any external analysison the file data item.

FIG. 2H illustrates an example user interface of the system in whichrelated data items are displayed in a graph 260 (for example, inresponse to selection of the “add to graph” button of FIG. 2B),according to an embodiment. The graph 260 is structured similarly tograph 1403 described in reference to FIG. 4 below, and accordingly thedescription of FIG. 4 applies to FIG. 2H, as appropriate. FIG. 2Hincludes a file data item 262 (for example, a received and analyzed filedata item) with links to various related data items. The related dataitems include a submission data item 264, another submission data item266 (for example, because the file data item 262 was submitted two timesto the system, as indicated and described above in reference to FIG.2C), a dropped file data item 272 (that was, for example, dropped by thefile data item 262 when the file data item 262 was executed in asandbox), and two data items 268 and 270 representing analysisinformation items from external analysis of the file data item. Variousother data items may be presented in the graph 260 including, forexample, related file data items, users associated with submission dataitems, and/or the like. Accordingly, in various embodiments, a graphuser interface such as the graph 260 may enable the analyst to visualizethe file data item and associated analysis, and efficiently and quicklydetermine one or more characteristics of the file data item, whether ornot the file data item is malware, and/or a threat level of the filedata item. For example, after automatic analysis of the submitted filedata item (as described above), the analyst may easily view variousanalysis information items by viewing one or more of the user interfaceof FIGS. 2A-2H. The analyst may quickly determine, for example, that thefile data item was previously submitted multiple times by multipleusers, and thus that the file data item is likely a high risk. Theanalyst may quickly determine, for example, that the file data itemmakes multiple modifications to a filesystem and registry, and that thetypes of modifications are likely malicious. Further, for example, theanalyst may, based on various analysis information items, have a hunchthat the file data item is malicious, and such a hunch may be confirmedby the various external analysis information items gathered by thesystem and provided to the analyst. Additionally, for example, theanalyst may easily determine that a given file data item is related toone or more other file data items that may, for example, contact similardomains. All of these examples, through use of the automatic analysisprovided by the system in various embodiments, may be accomplishedwithout manual analysis by the analyst of the file data item. Thus,according to these various examples and the various embodiments of thedisclosure described above, the system may enable the analyst to quicklyand efficiently evaluate a file data item for suspected malware.

Returning to the flowchart of FIG. 1, at optional blocks 120 and 121,file data items and/or analysis information items associated with filedata items of the data analysis system may be shared with variousentities (e.g., computing systems or groups of computing systems) withinan organization and/or one or more third-party systems, and/orthird-party systems may share file data items and related informationwith the system (for example, for association with one or more file dataitems). Similarly, sharing may take place between multipleinstantiations of the data analysis system as operated by, for example,multiple organization. Sharing of data may be limited in various ways,such as based on access rules that are determined by the informationproviding entity or a third-party mediator that facilitates sharing, andmay be limited in various ways, such as by recipient, by type ofrecipient, and/or by a type of data shared. In an embodiment, sharingmay be facilitated by a third-party system acting as, for example, amediator. Such a third-party system may facilitate sharing of data iteminformation among various other systems. Examples of sharing of datathat may be used in the data analysis system are described in U.S.patent application Ser. No. 14/280,490, previously incorporated byreferenced herein.

In an embodiment, the data analysis system encrypts and/or otherwisesecures stored file data items such that they may not be executed by thesystem when not being analyzed and/or outside of a sandbox environment.

In an embodiment, an analyst may add notes and/or tags to a file dataitem via a user interface of the system. For example, the analyst may,after reviewing the analysis, make a determination regarding the type ofmalware and/or threat level of the malware of the file data item, andmay add notes and/or tags to that effect to be associated with the filedata item. In this embodiment, other analysts may then be able to reviewthe notes and/or tags when accessing the file data item. Additionally,the analyst and/or other analysts may be able to determine any previoustimes a particular malware file has appeared on a network and detailsabout those instances. In an embodiment, a notification of the previousinstances a malware file has been found and/or analyzed may be providedto the analyst. In an embodiment, the analyst may mark the file dataitem as likely malware (or, for example, “malicious”) or not likelymalware.

Data Item-Centric Data Model

To provide a framework for the description of specific systems andmethods provided above and below, an example database system 1210 usingan ontology 1205 will now be described in reference to FIGS. 3A-3B and4. This description is provided for the purpose of providing an exampleand is not intended to limit the techniques to the example data model,the example database system, or the example database system's use of anontology to represent information.

In one embodiment, a body of data is conceptually structured accordingto data item-centric data model represented by ontology 1205. Theconceptual data model is independent of any particular database used fordurably storing one or more database(s) 1209 based on the ontology 1205.For example, each object of the conceptual data model may correspond toone or more rows in a relational database or an entry in LightweightDirectory Access Protocol (LDAP) database, or any combination of one ormore databases.

FIG. 3A illustrates data item-centric conceptual data model (which mayalso be referred to as an “object-centric conceptual data model”)according to an embodiment. An ontology 1205, as noted above, mayinclude stored information providing a data model for storage of data inthe database 1209. The ontology 1205 may be defined by one or more dataitem types (which may also be referred to as “object types”), which mayeach be associated with one or more property types. At the highest levelof abstraction, data item 1201 (which may also be referred to as a “dataobject” or “object”) is a container for information representing thingsin the world. For example, data item 1201 can represent an entity suchas a person, a place, an organization, a market instrument, or othernoun. Data item 1201 can represent an event that happens at a point intime or for a duration. Data item 1201 can represent a document or otherunstructured data source such as a file (for example, a malware file),an email message, a news report, or a written paper or article. Eachdata item 1201 is associated with a unique identifier that uniquelyidentifies the data item within the database system.

Different types of data items may have different property types. Forexample, a “file” data item (as described above) may have variousproperty types as described above (for example, various hash propertytypes, associated file property types, various external analysisproperty types, and/or the like), a “Person” data item may have an “EyeColor” property type, and an “Event” data item may have a “Date”property type. Each property 1203 as represented by data in the databasesystem 1210 may have a property type defined by the ontology 1205 usedby the database 1205. Properties of data items may, in an embodiment,themselves be data items and/or associated with data items. For example,file data items may be associated with various analysis informationitems, as described above. Analysis information items may comprise dataitems and/or properties associated with data items (for example, filedata items).

Items may be instantiated in the database 1209 in accordance with thecorresponding data item definition for the particular data item in theontology 1205. For example, a specific monetary payment (e.g., an itemof type “event”) of US$30.00 (e.g., a property of type “currency”)taking place on Mar. 27, 2009 (e.g., a property of type “date”) may bestored in the database 1209 as an event data item with associatedcurrency and date properties as defined within the ontology 1205.

The data objects defined in the ontology 1205 may support propertymultiplicity. In particular, a data item 1201 may be allowed to havemore than one property 1203 of the same property type. For example, a“Person” data item may have multiple “Address” properties or multiple“Name” properties.

Each link 1202 represents a connection between two data items 1201. Inone embodiment, the connection is either through a relationship, anevent, or through matching properties. A relationship connection may beasymmetrical or symmetrical. For example, “Person” data item A may beconnected to “Person” data item B by a “Child Of” relationship (where“Person” data item B has an asymmetric “Parent Of” relationship to“Person” data item A), a “Kin Of” symmetric relationship to “Person”data item C, and an asymmetric “Member Of” relationship to“Organization” data item X. The type of relationship between two dataitems may vary depending on the types of the data items. For example,“Person” data item A may have an “Appears In” relationship with“Document” data item Y or have a “Participate In” relationship with“Event” data item E. As an example of an event connection, two “Person”data items may be connected by an “Airline Flight” data itemrepresenting a particular airline flight if they traveled together onthat flight, or by a “Meeting” data item representing a particularmeeting if they both attended that meeting. In one embodiment, when twodata items are connected by an event, they are also connected byrelationships, in which each data item has a specific relationship tothe event, such as, for example, an “Appears In” relationship.

As an example of a matching properties connection, two “Person” dataitems representing a brother and a sister, may both have an “Address”property that indicates where they live. If the brother and the sisterlive in the same home, then their “Address” properties likely containsimilar, if not identical property values. In one embodiment, a linkbetween two data items may be established based on similar or matchingproperties (e.g., property types and/or property values) of the dataitems. These are just some examples of the types of connections that maybe represented by a link and other types of connections may berepresented; embodiments are not limited to any particular types ofconnections between data items. For example, a document might containreferences to two different items. For example, a document may contain areference to a payment (one item), and a person (a second item). A linkbetween these two items may represent a connection between these twoentities through their co-occurrence within the same document.

Each data item 1201 may have multiple links with another data item 1201to form a link set 1204. For example, two “Person” data itemsrepresenting a husband and a wife could be linked through a “Spouse Of”relationship, a matching “Address” property, and one or more matching“Event” properties (e.g., a wedding). Each link 1202 as represented bydata in a database may have a link type defined by the database ontologyused by the database.

FIG. 3B is a block diagram illustrating example components and data thatmay be used in identifying and storing data according to an ontology. Inthis example, the ontology may be configured, and data in the data modelpopulated, by a system of parsers and ontology configuration tools. Inthe embodiment of FIG. 3B, input data 1300 is provided to parser 1302.The input data may comprise data from one or more sources. For example,an institution may have one or more databases with information on creditcard transactions, rental cars, and people. The databases may contain avariety of related information and attributes about each type of data,such as a “date” for a credit card transaction, an address for a person,and a date for when a rental car is rented. The parser 1302 is able toread a variety of source input data types and determine which type ofdata it is reading.

In accordance with the discussion above, the example ontology 1205comprises stored information providing the data model of data stored indatabase 1209, and the ontology is defined by one or more data itemtypes 1310, one or more property types 1316, and one or more link types1330. Based on information determined by the parser 1302 or othermapping of source input information to item type, one or more data items1201 may be instantiated in the database 209 based on respectivedetermined item types 1310, and each of the items 1201 has one or moreproperties 1203 that are instantiated based on property types 1316. Twodata items 1201 may be connected by one or more links 1202 that may beinstantiated based on link types 1330. The property types 1316 each maycomprise one or more data types 1318, such as a string, number, etc.Property types 1316 may be instantiated based on a base property type1320. For example, a base property type 1320 may be “Locations” and aproperty type 1316 may be “Home.”

In an embodiment, a user of the system uses a item type editor 1324 tocreate and/or modify the item types 1310 and define attributes of theitem types. In an embodiment, a user of the system uses a property typeeditor 1326 to create and/or modify the property types 1316 and defineattributes of the property types. In an embodiment, a user of the systemuses link type editor 1328 to create the link types 1330. Alternatively,other programs, processes, or programmatic controls may be used tocreate link types and property types and define attributes, and usingeditors is not required.

In an embodiment, creating a property type 1316 using the property typeeditor 1326 involves defining at least one parser definition using aparser editor 1322. A parser definition comprises metadata that informsparser 1302 how to parse input data 1300 to determine whether values inthe input data can be assigned to the property type 1316 that isassociated with the parser definition. In an embodiment, each parserdefinition may comprise a regular expression parser 1304A or a codemodule parser 1304B. In other embodiments, other kinds of parserdefinitions may be provided using scripts or other programmaticelements. Once defined, both a regular expression parser 1304A and acode module parser 1304B can provide input to parser 1302 to controlparsing of input data 1300.

Using the data types defined in the ontology, input data 1300 may beparsed by the parser 1302 determine which item type 1310 should receivedata from a record created from the input data, and which property types1316 should be assigned to data from individual field values in theinput data. Based on the item/object-property mapping 1301, the parser1302 selects one of the parser definitions that is associated with aproperty type in the input data. The parser parses an input data fieldusing the selected parser definition, resulting in creating new ormodified data 1303. The new or modified data 1303 is added to thedatabase 1209 according to ontology 205 by storing values of the new ormodified data in a property of the specified property type. As a result,input data 1300 having varying format or syntax can be created indatabase 1209. The ontology 1205 may be modified at any time usingitem/object type editor 1324, property type editor 1326, and link typeeditor 1328, or under program control without human use of an editor.Parser editor 1322 enables creating multiple parser definitions that cansuccessfully parse input data 1300 having varying format or syntax anddetermine which property types should be used to transform input data300 into new or modified input data 1303.

The properties, data items, and links (e.g. relationships) between thedata items can be visualized using a graphical user interface (“GUI”).For example, FIG. 4 displays a user interface showing a graphrepresentation 1403 of relationships (including relationships and/orlinks 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, and 1413)between the data items (including data items 1421, 1422, 1423, 1424,1425, 1426, 1427, 1428, and 1429) that are represented as nodes in theexample of FIG. 4. In this embodiment, the data items include persondata items 1421, 1422, 1423, 1424, 1425, and 1426; a flight item 1427; afinancial account 1428; and a computer data item 1429. In this example,each person node (associated with person data items), flight node(associated with flight data items), financial account node (associatedwith financial account data items), and computer node (associated withcomputer data items) may have relationships and/or links with any of theother nodes through, for example, other data items such as payment dataitems.

For example, in FIG. 4, relationship 1404 is based on a paymentassociated with the individuals indicated in person data items 1421 and1423. The link 1404 represents these shared payments (for example, theindividual associated with data item 1421 may have paid the individualassociated with data item 1423 on three occasions). The relationship isfurther indicated by the common relationship between person data items1421 and 1423 and financial account data item 1428. For example, link1411 indicates that person data item 1421 transferred money intofinancial account data item 1428, while person data item 1423transferred money out of financial account data item 1428. In anotherexample, the relationships between person data items 1424 and 1425 andflight data item 1427 are indicated by links 1406, 1409, and 1410. Inthis example, person data items 1424 and 1425 have a common address andwere passengers on the same flight data item 1427. In an embodiment,further details related to the relationships between the various itemsmay be displayed. For example, links 1411 and 1412 may, in someembodiments, indicate the timing of the respective money transfers. Inanother example, the time of the flight associated with the flight dataitem 1427 may be shown.

Relationships between data items may be stored as links, or in someembodiments, as properties, where a relationship may be detected betweenthe properties. In some cases, as stated above, the links may bedirectional. For example, a payment link may have a direction associatedwith the payment, where one person data item is a receiver of a payment,and another person data item is the payer of payment.

In various embodiments, data items may further include malware analysismetadata and/or links. Such malware analysis metadata may be accessed bythe data analysis system for displaying objects and features on the userinterface (as described above).

In addition to visually showing relationships between the data items,the user interface may allow various other manipulations. For example,the data items within database 1108 may be searched using a searchinterface 1450 (e.g., text string matching of data item properties),inspected (e.g., properties and associated data viewed), filtered (e.g.,narrowing the universe of data items into sets and subsets by propertiesor relationships), and statistically aggregated (e.g., numericallysummarized based on summarization criteria), among other operations andvisualizations.

Implementation Mechanisms

According to an embodiment, the data analysis system and other methodsand techniques described herein are implemented by one or morespecial-purpose computing devices. The special-purpose computing devicesmay be hard-wired to perform the techniques, or may include digitalelectronic devices such as one or more application-specific integratedcircuits (ASICs) or field programmable gate arrays (FPGAs) that arepersistently programmed to perform the techniques, or may include one ormore general purpose hardware processors programmed to perform thetechniques pursuant to program instructions in firmware, memory, otherstorage, or a combination. Such special-purpose computing devices mayalso combine custom hard-wired logic, ASICs, or FPGAs with customprogramming to accomplish the techniques. The special-purpose computingdevices may be desktop computer systems, server computer systems,portable computer systems, handheld devices, networking devices or anyother device or combination of devices that incorporate hard-wiredand/or program logic to implement the techniques.

Computing device(s) are generally controlled and coordinated byoperating system software, such as iOS, Android, Chrome OS, Windows XP,Windows Vista, Windows 7, Windows 8, Windows Server, Windows CE, Unix,Linux, SunOS, Solaris, iOS, Blackberry OS, VxWorks, or other compatibleoperating systems. In other embodiments, the computing device may becontrolled by a proprietary operating system. Conventional operatingsystems control and schedule computer processes for execution, performmemory management, provide file system, networking, I/O services, andprovide a user interface functionality, such as a graphical userinterface (“GUI”), among other things.

For example, FIG. 5 is a block diagram that illustrates a computersystem 800 upon which the various systems and methods discussed hereinmay be implemented. Computer system 800 includes a bus 802 or othercommunication mechanism for communicating information, and a hardwareprocessor, or multiple processors, 804 coupled with bus 802 forprocessing information. Hardware processor(s) 804 may be, for example,one or more general purpose microprocessors.

Computer system 800 also includes a main memory 806, such as a randomaccess memory (RAM), cache and/or other dynamic storage devices, coupledto bus 802 for storing information and instructions to be executed byprocessor 804. Main memory 806 also may be used for storing temporaryvariables or other intermediate information during execution ofinstructions to be executed by processor 804. Such instructions, whenstored in storage media accessible to processor 804, render computersystem 800 into a special-purpose machine that is customized to performthe operations specified in the instructions.

Computer system 800 further includes a read only memory (ROM) 808 orother static storage device coupled to bus 802 for storing staticinformation and instructions for processor 804. A storage device 810,such as a magnetic disk, optical disk, or USB thumb drive (Flash drive),and/or any other suitable data store, is provided and coupled to bus 802for storing information (for example, file data items, analysisinformation data items, submission data items, and/or the like) andinstructions.

Computer system 800 may be coupled via bus 802 to a display 812, such asa cathode ray tube (CRT), LCD display, or touch screen display, fordisplaying information to a computer user and/or receiving input fromthe user. An input device 814, including alphanumeric and other keys, iscoupled to bus 802 for communicating information and command selectionsto processor 804. Another type of user input device is cursor control816, such as a mouse, a trackball, or cursor direction keys forcommunicating direction information and command selections to processor804 and for controlling cursor movement on display 812. This inputdevice typically has two degrees of freedom in two axes, a first axis(e.g., x) and a second axis (e.g., y), that allows the device to specifypositions in a plane. In some embodiments, the same directioninformation and command selections as cursor control may be implementedvia receiving touches on a touch screen without a cursor.

Computing system 800 may include a user interface module, and/or variousother types of modules to implement one or more graphical user interfaceof the data analysis system, as described above. The modules may bestored in a mass storage device as executable software codes that areexecuted by the computing device(s). This and other modules may include,by way of example, components, such as software components,object-oriented software components, class components and taskcomponents, processes, functions, attributes, procedures, subroutines,segments of program code, drivers, firmware, microcode, circuitry, data,databases, data structures, tables, arrays, and variables.

In general, the word “module,” as used herein, refers to a collection ofsoftware instructions, possibly having entry and exit points, written ina programming language, such as, for example, Java, Lua, C or C++. Asoftware module may be compiled and linked into an executable program,installed in a dynamic link library, or may be written in an interpretedprogramming language such as, for example, BASIC, Perl, or Python. Itwill be appreciated that software modules may be callable from othermodules or from themselves, and/or may be invoked in response todetected events or interrupts. Software modules configured for executionon computing devices may be provided on a computer readable medium, suchas a compact disc, digital video disc, flash drive, magnetic disc, orany other tangible medium, or as a digital download (and may beoriginally stored in a compressed or installable format that requiresinstallation, decompression or decryption prior to execution). Suchsoftware code may be stored, partially or fully, on a memory device ofthe executing computing device, for execution by the computing device.Software instructions may be embedded in firmware, such as an EPROM. Itwill be further appreciated that hardware devices (such as processorsand CPUs) may be comprised of connected logic units, such as gates andflip-flops, and/or may be comprised of programmable units, such asprogrammable gate arrays or processors. Generally, the modules describedherein refer to logical modules that may be combined with other modulesor divided into sub-modules despite their physical organization orstorage. In various embodiments, aspects of the methods and systemsdescribed herein may be implemented by one or more hardware devices, forexample, as logic circuits. In various embodiments, some aspects of themethods and systems described herein may be implemented as softwareinstructions, while other may be implemented in hardware, in anycombination.

As mentioned, computer system 800 may implement the techniques describedherein using customized hard-wired logic, one or more ASICs or FPGAs,firmware and/or program logic which in combination with the computersystem causes or programs computer system 800 to be a special-purposemachine. According to one embodiment, the techniques herein areperformed by computer system 800 in response to processor(s) 804executing one or more sequences of one or more modules and/orinstructions contained in main memory 806. Such instructions may be readinto main memory 806 from another storage medium, such as storage device810. Execution of the sequences of instructions contained in main memory806 causes processor(s) 804 to perform the process steps describedherein. In alternative embodiments, hard-wired circuitry may be used inplace of or in combination with software instructions.

The term “non-transitory media,” and similar terms, as used hereinrefers to any media that store data and/or instructions that cause amachine to operate in a specific fashion. Such non-transitory media maycomprise non-volatile media and/or volatile media. Non-volatile mediaincludes, for example, optical or magnetic disks, such as storage device810. Volatile media includes dynamic memory, such as main memory 806.Common forms of non-transitory media include, for example, a floppydisk, a flexible disk, hard disk, solid state drive, magnetic tape, orany other magnetic data storage medium, a CD-ROM, any other optical datastorage medium, any physical medium with patterns of holes, a RAM, aPROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip orcartridge, and networked versions of the same.

Non-transitory media is distinct from but may be used in conjunctionwith transmission media. Transmission media participates in transferringinformation between nontransitory media. For example, transmission mediaincludes coaxial cables, copper wire and fiber optics, including thewires that comprise bus 802. Transmission media can also take the formof acoustic or light waves, such as those generated during radio-waveand infra-red data communications.

Various forms of media may be involved in carrying one or more sequencesof one or more instructions to processor 804 for execution. For example,the instructions may initially be carried on a magnetic disk or solidstate drive of a remote computer. The remote computer can load theinstructions and/or modules into its dynamic memory and send theinstructions over a telephone line using a modem. A modem local tocomputer system 800 can receive the data on the telephone line and usean infra-red transmitter to convert the data to an infra-red signal. Aninfra-red detector can receive the data carried in the infra-red signaland appropriate circuitry can place the data on bus 802. Bus 802 carriesthe data to main memory 806, from which processor 804 retrieves andexecutes the instructions. The instructions received by main memory 806may optionally be stored on storage device 810 either before or afterexecution by processor 804.

Computer system 800 also includes a communication interface 818 coupledto bus 802. Communication interface 818 provides a two-way datacommunication coupling to a network link 820 that is connected to alocal network 822. For example, communication interface 818 may be anintegrated services digital network (ISDN) card, cable modem, satellitemodem, or a modem to provide a data communication connection to acorresponding type of telephone line. As another example, communicationinterface 818 may be a local area network (LAN) card to provide a datacommunication connection to a compatible LAN (or WAN component tocommunicated with a WAN). Wireless links may also be implemented. In anysuch implementation, communication interface 818 sends and receiveselectrical, electromagnetic or optical signals that carry digital datastreams representing various types of information.

Network link 820 typically provides data communication through one ormore networks to other data devices. For example, network link 820 mayprovide a connection through local network 822 to a host computer 824 orto data equipment operated by an Internet Service Provider (ISP) 826.ISP 826 in turn provides data communication services through the worldwide packet data communication network now commonly referred to as the“Internet” 828. Local network 822 and Internet 828 both use electrical,electromagnetic or optical signals that carry digital data streams. Thesignals through the various networks and the signals on network link 820and through communication interface 818, which carry the digital data toand from computer system 800, are example forms of transmission media.

Computer system 800 can send messages and receive data, includingprogram code, through the network(s), network link 820 and communicationinterface 818. In the Internet example, a server 830 might transmit arequested code for an application program through Internet 828, ISP 826,local network 822 and communication interface 818. For example, in anembodiment various aspects of the data analysis system may beimplemented on one or more of the servers 830 and may be transmitted toand from the computer system 800. For example, submitted malware dataitems may be transmitted to one of the servers 830, aspects of the basicanalysis may be implemented on one or more of the servers 830, and/oraspects of the external analysis may be implemented on one or more ofthe servers 830. In an example, requests for external analyses of filedata items may be transmitted to one or more third-party servers 830(from, for example, the computer system 800 and/or another server 830 ofthe system), and analysis data may then be transmitted back fromthird-party servers 830.

In an embodiment, the data analysis system may be accessible by the userthrough a web-based viewer, such as a web browser. In this embodiment,the user interfaces of the system may be generated by a server (such asone of the servers 830) and/or the computer system 800 and transmittedto the web browser of the user. The user may then interact with the userinterfaces through the web-browser. In an embodiment, the computersystem 800 may comprise a mobile electronic device, such as a cellphone, smartphone, and/or tablet. The system may be accessible by theuser through such a mobile electronic device, among other types ofelectronic devices.

Additional Embodiments

While the foregoing is directed to various embodiments, other andfurther embodiments may be devised without departing from the basicscope thereof. For example, aspects of the present disclosure may beimplemented in hardware or software or in a combination of hardware andsoftware. An embodiment of the disclosure may be implemented as aprogram product for use with a computer system. The program(s) of theprogram product define functions of the embodiments (including themethods described herein) and may be contained on a variety ofcomputer-readable storage media. Illustrative computer-readable storagemedia include, but are not limited to: (i) non-writable storage media(e.g., read-only memory devices within a computer such as CD-ROM disksreadable by a CD-ROM drive, flash memory, ROM chips or any type ofsolid-state non-volatile semiconductor memory) on which information ispermanently stored; and (ii) writable storage media (e.g., hard-diskdrive or any type of solid-state random-access semiconductor memory) onwhich alterable information is stored. Each of the processes, methods,and algorithms described in the preceding sections may be embodied in,and fully or partially automated by, code modules executed by one ormore computer systems or computer processors comprising computerhardware. The processes and algorithms may alternatively be implementedpartially or wholly in application-specific circuitry.

The various features and processes described above may be usedindependently of one another, or may be combined in various ways. Allpossible combinations and subcombinations are intended to fall withinthe scope of this disclosure. In addition, certain method or processblocks may be omitted in some implementations. The methods and processesdescribed herein are also not limited to any particular sequence, andthe blocks or states relating thereto can be performed in othersequences that are appropriate. For example, described blocks or statesmay be performed in an order other than that specifically disclosed, ormultiple blocks or states may be combined in a single block or state.The example blocks or states may be performed in serial, in parallel, orin some other manner. Blocks or states may be added to or removed fromthe disclosed example embodiments. The example systems and componentsdescribed herein may be configured differently than described. Forexample, elements may be added to, removed from, or rearranged comparedto the disclosed example embodiments.

Conditional language, such as, among others, “can,” “could,” “might,” or“may,” unless specifically stated otherwise, or otherwise understoodwithin the context as used, is generally intended to convey that certainembodiments include, while other embodiments do not include, certainfeatures, elements and/or steps. Thus, such conditional language is notgenerally intended to imply that features, elements and/or steps are inany way required for one or more embodiments or that one or moreembodiments necessarily include logic for deciding, with or without userinput or prompting, whether these features, elements and/or steps areincluded or are to be performed in any particular embodiment.

The term “comprising” as used herein should be given an inclusive ratherthan exclusive interpretation. For example, a general purpose computercomprising one or more processors should not be interpreted as excludingother computer components, and may possibly include such components asmemory, input/output devices, and/or network interfaces, among others.

Any process descriptions, elements, or blocks in the flow diagramsdescribed herein and/or depicted in the attached figures should beunderstood as potentially representing modules, segments, or portions ofcode which include one or more executable instructions for implementingspecific logical functions or steps in the process. Alternateimplementations are included within the scope of the embodimentsdescribed herein in which elements or functions may be deleted, executedout of order from that shown or discussed, including substantiallyconcurrently or in reverse order, depending on the functionalityinvolved, as would be understood by those skilled in the art.

It should be emphasized that many variations and modifications may bemade to the above-described embodiments, the elements of which are amongother acceptable examples. All such modifications and variations areintended to be included herein within the scope of this disclosure. Theforegoing description details certain embodiments of the invention. Itwill be appreciated, however, that no matter how detailed the foregoingappears in text, the invention may be practiced in many ways. As is alsostated above, it should be noted that the use of particular terminologywhen describing certain features or aspects of the invention should notbe taken to imply that the terminology is being re-defined herein to berestricted to including any specific characteristics of the features oraspects of the invention with which that terminology is associated. Thescope of the invention should therefore be construed in accordance withthe appended claims and any equivalents thereof.

What is claimed is:
 1. A computer network comprising: a databaseconfigured to store file data items; and one or more hardware computerprocessors configured to execute computer executable instructions inorder to: receive a first data item including a suspected malware file;store, in the database, the first data item in association with at leastone of: a date of submission of the first data item, or an identifier ofthe person who submitted the first data item; initiate an internalanalysis of the first data item to generate an internal analysisinformation item; transmit the first data item to an external analysisprovider outside of the computer system for external analysis; receive,from the external analysis provider, an external analysis informationitem; and generate a graphical user interface presenting analysisinformation items associated with the first data item, the graphicaluser interface including at least: a first node representing the firstdata item, and a second node representing the internal analysisinformation item.
 2. The computer system of claim 1, wherein thegraphical user interface further includes a third node representing theexternal analysis information item, wherein the first, second, and thirdnodes are linked by edges in a graph or web.
 3. The computer system ofclaim 1, wherein the one or more hardware computer processors arefurther configured to execute computer executable instructions in orderto: search the database for previously submitted data items matching thefirst data item; and generate a displayable notification indicating thatthe first data item was previously submitted.
 4. The computer system ofclaim 3, wherein the displayable notification indicates the date thatthe first data item was previously submitted.
 5. The computer system ofclaim 3, wherein the displayable notification indicates an identifier ofthe person who submitted the first data item.
 6. The computer system ofclaim 1, wherein nodes in the graphical user interface are userselectable icons.
 7. The computer system of claim 1, wherein the one ormore hardware computer processors are further configured to executecomputer executable instructions in order to: receive a submission of asecond data item representing a suspected malware file; and generate afourth node in the graphical user interface, the fourth node indicatingthe submission of the second data item.
 8. The computer system of claim7, wherein the one or more hardware computer processors are furtherconfigured to execute computer executable instructions in order to:compare an analysis information item of the second data item to at leastone of the internal analysis information item or the external analysisinformation item; determine that the second data item and the first dataitem match; and in response to determining that the second data item andthe first data item match, associate a second submission event with thefirst data item.
 9. The computer system of claim 8, wherein comparingthe analysis information item of the second data item to at least one ofthe internal analysis information item or the external analysisinformation item includes: calculating a hash of the second data item;and comparing the calculated hash to a previously calculated hash of thefirst data item.
 10. The computer system of claim 7, the graphical userinterface including at least: the first node representing the first dataitem, the second node representing the internal analysis informationitem, a third node representing the submission of the first data item,and a fourth node representing a submission of the second data item. 11.The computer system of claim 10, the graphical user interface furtherincluding at least: a fifth node representing the external analysisinformation item.
 12. The computer system of claim 11, wherein thegraphical visualization further includes edges linking the first node tothe second, third, fourth, and fifth nodes.
 13. The computer system ofclaim 7, wherein the one or more hardware computer processors arefurther configured to execute computer executable instructions in orderto: receive a submission of a third data item, the third data itemrepresenting another suspected malware file; compare the third data itemwith at least one of the first data item or the second data item;determine that at the third data item and at least one of the first dataitem or the second data item match; generate a fifth node in thegraphical user interface, the fifth node indicating the submission ofthe third data item and linked to at least one of the first or thirdnode; and provide a notification that the third data item was previouslyreceived.
 14. The computer system of claim 1, wherein the internalanalysis includes at least calculation of a hash of the data item. 15.The computer system of claim 14, wherein the hash is at least one of anMD5 hash of the first data item, a SHA-1 hash of the first data item, aSHA-256 hash of the first data item, an SSDeep hash of the first dataitem, or a size of the first data item.
 16. The computer system of claim15, the external analysis includes analysis performed by at least asecond computer system, and wherein the external analysis includesexecution of the first data item in a sandboxed environment and analysisof the first data item by a third-party malware analysis service. 17.The computer system of claim 16, wherein any payload provided by thefirst data item after execution of the first data item in the sandboxedenvironment is indicated as a node in the graphical user interface. 18.The computer system of claim 1, wherein the one or more hardwarecomputer processors are further configured to execute computerexecutable instructions in order to: share the first data item andassociated analysis information items with a second computer system viaa third computer system.